Processors should support integer math instructions that optionally trap on overflow. Because popular architectures lack this feature, otherwise excellent modern systems programming languages, such as Rust, Go, and D, have default integer types that wrap. This is bad because unexpected wrapping causes programs to produce incorrect results, although of course integer overflow in a safe language is not the utter disaster that it is in C/C++, where integer overflow and type unsafety cooperate to create the worst kinds of bugs. The reason that Rust provides wrapping integers is that so far, the costs of a better semantics — both in terms of runtime overhead and in terms of implementation effort for the Rust team — exceed the perceived benefits. My belief is that hardware traps could change this inequation in a favorable way.
The lack of trapping math instructions doesn’t only hurt low-level languages. In JavaScript, where all numbers are semantically double-precision floats, replacing doubles with integer types in the runtime requires expensive software overflow checks. I’ve heard that this results in a 5-10% slowdown for basically all JavaScript code. In languages such as Python and Racket the default integer type overflows into bignums instead of doubles; Matthew Flatt tells me that Racket’s performance penalty due to software overflow checks probably exceeds 100% for jitted tight loops that do lots of integer math.
You might be saying to yourself: But I require wrapping integers to implement crypto codes and PRNGs and hash functions and stuff like that. The answer is easy: we provide wrapping operators that can be used in these specialized situations. One choice would be +., -., etc. In a unicode language we might use ⊞, ⊟, etc. (Jesse Ruderman suggested this, and I like the idea).
Architectures such as MIPS and Alpha support integer overflow traps. However, to a good approximation, the only architectures that matter right now are ARM’s and Intel’s. There are two issues in adding integer overflow traps to these ISAs. First, where do we get opcode space for the new trapping instructions? For x86 and x86-64, which support an elaborate system of instruction prefixes, it may make sense to use that mechanism, although this comes with a code size penalty. ARM has fixed-size instructions so finding space may be trickier there. A mail to a friend at ARM on this topic has so far gone unanswered, but I am guessing that this could be shoehorned into ARMv9 somehow. The second issue is the effect of integer overflow traps on chip area and critical path length. An experienced architect who I talked to doesn’t think these are serious problems, and in any case the complexity is dramatically lower than the complexity of implementing something like hardware support for array bounds checking.
This post isn’t as much of an opinion piece as a plea to the folks at ARM, Intel, and AMD: Please provide this feature. It is needed in order to make high-level languages faster and low-level languages saner.
UPDATE: Some data about the overhead of software integer overflow checking in C/C++ codes can be found in section IIID of this paper. Please keep in mind, however, that this implementation is tuned for debugging not performance. A highly tuned software integer undefined behavior checker for C/C++ could probably have overhead in the 5% range.
38 responses to “We Need Hardware Traps for Integer Overflow”
How many problem domains require wrapping signed integers or non-wrapping unsigned integers? Can the current undefined case be replaced with a trap and everything else be left intact?
Another alternative to providing other operators would be to allow the existing types to trap and then provide a set of types that specifically represent Galois fields of order 2áµ.
From an implementation standpoint, it might be possible to leave the existing op-codes as is and allow code to enable/disable a trap-on-overflow feature. Or maybe add a “sticky” overflow accumulator bit plus clear and trap-if-set instructions (after all, you don’t in general need to know exactly which instruction in an expression overflowed, just that at least one did).
Hi bcs, we did quite a bit of manual disambiguation of intentional vs. unintentional wrapping while working on integer overflow detection for clang. It is hard work. My belief is that intentional wrapping is sufficiently rare that requiring people to use a separate operator wouldn’t be a big burden and would serve as an excellent form of documentation.
I’ve grown to dislike unsigned types but agree that they are a perfectly valid design point in a language that avoids the serious problems with implicit coercion that C/C++ have. Rust gets unsigned right, for example.
A conforming C/C++ implementation can trap when signed overflow occurs, and in fact this would be an excellent thing if it could be done with sufficiently low overhead that people could use it for production code.
I hadn’t considered making overflow traps into a processor mode. That would work well in the case where wrapping behavior is very rare. Regarding the sticky overflow bit, AIR integers basically implement this in software:
http://resources.sei.cmu.edu/asset_files/TechnicalNote/2010_004_001_15164.pdf
This same thing was done earlier in Ada. I think it is a good design point and according to the AIR paper it results in much better performance on current cores.
Perhaps it would be enough if processors were optimized to handle code where an add instruction was followed by a branch on overflow (or trap on overflow) instruction. (My guess off the top of my head is that right now they are not optimized for this case as it is probably not very common in benchmarks (e.g., only C and C++ is used in CINT2006 for example).)
However, the most useful feature for a developer would perhaps be a combination of the m68k CHK2 and TRAPV instruction, as this would allow you to support traps when a ranged integer is outside bounds. (Sorry, I’m not familiar enough with x86 off the top of my head to mention any equivalent instructions.) It would be interesting if an Ada program with lots of ranged integers was a part of SPEC CPU2006 or a similar well known benchmark…
When something like that isn’t widely implemented, it makes me wonder if there is a patent or two in there preventing widespread adoption.
Dan, that would be awful.
Andreas, I should have discussed the form of the instruction in this post. Yes, CHK2 is awesome. Something like this is needed in order to support Racket, Haskell, and other languages that reserve the high bits of integers for their own use. In Haskell an integer only has to go up to 2^29-1.
I grew up with ranged integers in Pascal and still miss their absence in the languages I use now.
I have a feeling that on ARM (at least looking at the M3), something like one of the xPSR registers might be a more appropriate place to hold a “trap on overflow” setting. All of them have (1) available bits, (2) several, currently reserved, bits in common and (3) should be saved/restored on interrupts/thread switches. Given that, and some indicator flags to tell you that an overflow has happened, it doesn’t sound so outlandish to me. ARM already has implemented overflow detection for saturation instructions, so they shouldn’t need to change much. This kind of approach would also avoid any instruction set changes.
What about the x86 INTO instruction? shouldn’t that go a long way to enabling traps on overflow?
Any processor worth its weight in gold has it already!
Eg.: TRAPcc in 680×0
And any compiler worth its bits already does not use them, since it’s much faster to branch rather than trap! So they just test for overflow, and branch out to bignum promotion and subroutines, instead of trapping, handling the interruption, identifying the problem and correcting it.
Hi Informatimago, you are correct that branching is faster than trapping, but you are overlooking the fact that not trapping is faster than branching. In the expected case where overflow does not occur and we do not need to promote to bignum, math can proceed at native speed instead of taking the ~2x performance hit that is caused by putting a branch after every instruction that might overflow.
For C/C++, please see Section IIId of our paper about integer overflow checking, which contains some performance numbers:
http://www.cs.utah.edu/~regehr/papers/overflow12.pdf
If you just want the quick version, mean slowdown of SPEC CINT 2006 is 44%. Of course this could be improved but the overhead of these branches is nontrivial. We’re working on an expanded version of this paper that has a lot more performance data.
Hi me, I always thought there was something wrong with INTO but can’t remember where I heard that. Anyway I believe it’s gone in x86-64. Also it would be great to have a more flexible checker that takes bounds as arguments.
As someone who used to program in Modula-2, and whose company’s main product was originally written almost entirely in Modula-2 and Oberon-2, I have never experienced nor witnessed a performance problem that would have been attributed to soft integer overflow checks, which are enabled by default in these languages. Moreover, in Modula-2 you can define subrange types with checks on assignment/parameter passing:
TYPE
MyRange = [1..42];
CapitalLetters = CHAR[‘A’..’Z’];
Yes, when performance (or size) of a piece of code is absolutely critical, you’d have to disable such checks on that specific piece. But that is less than 1% of all code, and you would thoroughly test and review such critical pieces in any case, right?
P.S. Regarding 1%, okay, I made up that number, but I think you get the point.
Not sure if you’re aware of this, but on ARM there are the qadd and qsub instructions which do saturating add and subtract respectively. So it should be no problem making a programming language that saturates by default. Multiplies still require a flags check, but that doesn’t seem terribly unreasonable.
As others have mentioned, getting saturation on x86(_64) chips is a whole different story..
Thanks a lot for this call, I hope that you will be heard.
What really infuriating is that the implementation cost of trap is nearly zero when you think about the complexity of a modern CPU and it would bring a really improved integer semantic, but it’s the 21st century we still don’t have it (MIPS alone isn’t enough)..
The hardest part is finding the room “trap bit” in the instruction encoding on RISCs, but I think that a ‘trap bit’ in register-only instructions(no immediate) would be a “good enough” start.
Yeah, the availability of INTO gets in the way big time… examining the code generated by clang intrinsics for overflow checks suggests that we’d need some form of scoped mechanism for better code generation (ie only check the overflow/carry flags before they get conceivably cleared, have one handler for a scope). That ought not to be too hard to implement for substantial improvement in code quality…
Hi Mark, I knew that ARM had some saturation support but didn’t know any details. I’m not sure that’s the right answer, though.
Instead of a wrapping version of each relevant operator, you could also have the programmer simply impose a truncation on the result, and in the compiler recognize that the combined operation is a single instruction in the processor. Like: trunc32(a + b)
I like the idea of it being a processor mode. I’ve had good luck with using this technique to track down floating-point problems. I use a class to enable/disable some FPU exceptions (divide-by-zero, overflow, and illegal operation) and try to only have them suppressed for those rare pieces of code that require them. In practice that also means suppressing them for sloppy code that it’s not worth fixing. This model could work for integer overflow.
For backwards compatibility the integer overflow traps would have to be off by default, but smart programmers would enable them aggressively.
I liked Informatimago’s comment and John’s response: “branching is faster than trapping, but not trapping is faster than branching”. However there is another piece of the design space, “not branching” — which you can do if your compiler can infer more things about ranges.
Trap-on-overflow is most helpful for systems with basic compilers. Systems with better compilers don’t need it as much.
Hi Andy, there’s a ton of optimization work that can be done, but on the other hand my belief is that in some of our large security-critical programs like web browsers and OS kernels, there will be plenty of checks that cannot be killed at compile time, so we still want the runtime checks to be efficient.
I’m not sure the benefit of trapping instead of just branching is overrated. For one, in the paper you referenced, the measured overhead includes the cost of the trap actually triggering and being handled. I would claim that this does not therefore meaningfully reflect what you *really* want to measure, namely overflow checks as safe-guards in well-behaved programs where they do not actually trigger (after all, the intent would be to treat it as an “exceptional event” and disrupt control flow subsequently anyways).
FWIW I tried the most stupid approach, measuring run-time of two loops performing either no check or a gratuituous overflow check:
int fn_checked(int x) {
while (x < 0x40000000) {
__asm__(
"incl %0\n"
"jo 1f\n"
".subsection 2\n"
"1: \n"
"int3\n"
".previous\n"
: "=r"(x) : "0"(x));
}
return x;
}
int fn_unchecked(int x) {
while (x < 0x40000000) {
__asm__(
"incl %0\n"
: "=r"(x) : "0"(x));
}
return x;
}
I am unable to measure any time difference between the two whatsoever, and I think it is already an exaggerated example. (Yes it is artificial in that the branch predictor just cannot possibly get it wrong in such a tight loop).
The proprietary Mill CPU architecture (http://millcomputing.com/docs/) has support for four overflow modes: Excepting, Modulo, Saturating, and Widening. Really hope The Mill takes off, fascinating series of talks about their architecture and toolchain.
Here are two more reasons to lean towards doing it with branches rather than with a hardware trap.
One is that at the hardware, trapping might be just as hard as handling branches. An instruction that can trap is still going to either stall the pipeline or use up some of the limited transitors dedicated to speculative evaluation. It’s even worse if it’s a CPU mode; the CPU has to basically support two instructions, but it doesn’t know which one it will be until it gets there.
That leads to the second issue: the right comparison is not Intel adding a trapping version of arithmetic versus Intel doing nothing. The comparison should be Intel adding hardware trapping versus Intel supporting this new branching pattern.
@Lex Spoon
> One is that at the hardware, trapping might be just as hard as handling branches.
For the CPU? Maybe, but you fail to take into account the instruction cache and memory bandwidth’s usage that the branches instruction use where as the trap’s use much less (though it may be not free due to the increase of the instructions size).
> [cut] versus Intel supporting this new branching pattern.
“new” branching pattern? What’s new about it? It’s just branches predicted as not taken..
I don’t recommend using a CPU mode.
I doubt there are any patents on any of this; older architectures had the TRAPcc instructions, I’d be stunned if there was not at least one older architecture that would trap on the instruction itself (recall that the C standard, which allows exactly that, acquired its squirrelly definition so that such processors could support a standard-conforming C implementation without any compromises in efficiency).
One issue with branches vs traps has to do with hardware resources; branch predictors have finite resources, so there is a risk that doubling the number of branches in a typical flow path will compromise prediction efficiency (there are some hacks, certainly obvious to me, such as using compiled-in prediction bits on architectures that support this and not wasting branch prediction resources on any branch that matches its prediction; and note that in the limit, the sign on a PC-relative branch displacement can be used as a prediction bit, where backwards is assumed taken and forwards is assumed not-taken, and this is also old, obvious knowledge, in some cases used by compiler writers on the assumption that hardware designers would make the obvious intelligent choices).
It seems like Swift does exactly what you propose:
https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/AdvancedOperators.html#//apple_ref/doc/uid/TP40014097-CH27-XID_37
LLVM can do the hard/portable/optimization work : http://llvm.org/docs/LangRef.html#arithmetic-with-overflow-intrinsics
Hi LLVM-fan, definitely, though there’s plenty of work left to do in making LLVM passes aware of these intrinsics and removing them when possible.
lxgr, I was pretty psyched to see Swift’s integers, which look just about right!
One issue with this is that it does prevent reordering additions by the compiler, because once you detect overflow, addition is no longer associative. This would also prevent SIMD-ising sums, so in extreme cases you could lose a factor of 4 or 8. So even the option of a sticky overflow flag, which would seem like it would have zero performance impact, in fact can.
Hi Alex B, this problem can be worked around by specifying that traps can be delayed, for example, until the end of evaluating an expression. Ada did this.
Rather than inventing a separate set of operators, just use the existing operators on wraparound types. If you need both kinds of operations on the same values, use a type conversion.
Ada does this: “+” on a modular type wraps around, while “+” on a signed integer type raises an exception on overflow.
C *almost* does this as well: “+” on an unsigned type wraps around, while an overflowing “+” on a signed integer type has undefined behavior, which could easily be replaced by a trap (or, in C++, an exception).
Keith, yeah, that’s a viable design if the nasty implicit conversions between signed and unsigned were eliminated:
http://blog.regehr.org/archives/268
Those who do not understand the hardware are doomed to blog about it, poorly.
Intel chips already detect overflow, that is what the OF (bit 11) flag in the flags/eflags register indicates. That the most recent operation overflowed.
Testing, and generating a trap, for an overflow is a single instruction (JO – jump if overflow and JNO – jump if no overflow).
This is true of almost all CPU’s. At the hardware/assembly programming level, overflow detection is handled by hardware, and detected by a single instruction.
The reason all of your languages listed don’t have any sort of trap/etc. is simple. The language designers did not bother to check the result of their operations. Why, most likely because almost all of them are ultimately implemented in C, and C does not reflect the overflow status back into the C language level.
So the problem isn’t the hardware. It is the software architectural design(s) implemented above the hardware. They are not bothering to communicate to you the results of what the hardware already has available and tells them about.
Hi Anon, using JO is obvious. My post is about getting rid of all of the JO instructions. This has nothing to do with C.
John,
Interesting post. It seems to me that a bunch of JO instructions that surely predicted “not taken” are not going to in themselves have much effect on performance. I can see two reasons why they might matter: (1) they could confuse later compiler passes and prevent optimizations (it seems from my cursory scan of your paper that that is what is going on) or (2) they could put more pressure on the branch predictor hardware by making it keep track of more branches. Problem (1) doesn’t require any hardware modification, it “just” requires smarter compilers. If problem (2) is real, it seems like something that can be handled by better management of branch prediction entries, without any ISA changes.
Hi Andrew, good to see you here!
There is a hole in your argument (which is the same argument made by several others) that nobody has patched so far.
First of all, I agree that JO instructions are cheap when they can be predicted correctly. Second, would you agree that regular ALU instructions like ADD and SUB are also cheap? If so, then I think you need to present an argument explaining why doubling the number of cheap instructions does not have much effect on performance.
Regarding the effect of overflow checks on optimizations, this is very real. It can be worked around by weakening the language semantics a bit, for example by permitting traps to fire a bit late, for example at the end of an expression, or perhaps only before the overflowed value influences a side effect.
Here’s another argument. Did you consider the cost of adding such feature? Anything that potentially deals with control flow has potential to increase branch misprediction penalties, hurt clockspeed, hurt existing architectural optimizations, destroy future optimization potential, increase instruction latencies… Its about having multiple mechanism for same thing. And then consider that 5% increase is not a lot, and its not ALL the workloads that use that CPU so the cost needs to be miniscule to be reasonable. Actually the change maybe large enough that opportunity cost alone is bigger than the benefit.
Intel’s latest generation CPU added second branch unit so the softwarecost of adding not taken branch after each addition could be almost mostly gone compared to previous generations. (Not measured, since I don’t have such hardware).
What I consider most likely method of implementation could result of only being able to decode ONE such instruction per cycle on intel designs while old method in new processors would have potential of doing two per cycle, unlike over year old processors that could do one per cycle max.
The only thing I’d ask Intel to consider in hardware is change one of reserved flag bits to persistent OF/CF, and having branch related to that.
But in reality I have no idea its relative costs since I don’t know the details of flags dealing hardware. The cost maybe totally insignificant or large , because it uses existing hardware that operates on same execution unit it operates with in quantities it normally operates.
And maybe the most general and cost effective way of dealing with this is speeding up execution of existing instruction sequences that do it.
Interesting proposal, though I disagree with it for vague reasons. My feeling is that the added cost for implementing a new trap path is not worth it, but I admit this is just speculation as I don’t know the specifics of costs involved. I tend to think we can come close to this in software with a smarter -ftrapv that only emits traps when the compiler determines the operation cannot overflow.
With regards to finding room in the opcode space, any sensibly designed ISA has such room to enable extensions. I’m not an expert in this area, but both ARM and x86 have such deliberate holes.
@regehr: I think that you are overemphasizing the cost of the JO instructions. What matters most from the execution point of view is the “dependence graph” of operations. Your remark of “turning a single simple operation into two simple operations cannot be free of costs” does not view the problem in these terms (case in point: two data-dependent adds are slower than two data-independent adds).
If the CPU speculates the branch as not taken (and possibly additional heuristics and/or compiler assist may help here), then the data dependence graph looks the same from the processors POV, and that should be sufficient to recover the main possible performance cost easily.
The remaining concern would be increased icache pressure, BTB pressure etc. This is valid, but I am not sure any of the numbers presented so far actually measure this.
The other thing to be noted is that the program parts that would suffer most from being penalized by additional checks (tight inner loops) are also those where compilers tend to have best chances to analyzing the overflow checks away. Incapacity of compilers to handle such checks might still be reason to solve the problem in silicon, but I am not sure the data so far is conclusive. gcc for example is doing fine using this approach handling e.g. Ada which requires such runtime checks.